Big Mobility Data Analytics - Task¶

Objective: Use NY taxi trips and restaurants datasets and combine them spatially, create a visualization and perform EDA
Date: February 20, 2023
Author: Chirag Khatri ckhatri@usc.edu
GitHub

Table of Contents¶

  1. Importing libraries
  2. Reading data
  3. Preliminary Exploratory Data Analysis
    1. Missing values in data
    2. Working with date columns
    3. Univariate Analysis
  4. Linking datasets spatially
    1. Geocoding restaurant addresses
    2. Distance between two [lat, lng] points
    3. Merged dataset - Trips with nearby restaurant
  5. Exploratory Visualization
  6. EDA Questions
    1. Trip Distance
    2. Tipping Rate
  7. Predictive Model

1. Importing libraries ¶

In [34]:
import ast
import json

from geopy import distance

import pandas as pd
import plotly.express as px
import requests

2. Reading data ¶

In [2]:
nyctrips = pd.read_csv("./data/nyctripsdata.csv")
print(nyctrips.info())
nyctrips.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 97986 entries, 0 to 97985
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   VendorID               97986 non-null  int64  
 1   tpep_pickup_datetime   97986 non-null  object 
 2   tpep_dropoff_datetime  97986 non-null  object 
 3   passenger_count        97986 non-null  int64  
 4   trip_distance          97986 non-null  float64
 5   pickup_longitude       97986 non-null  float64
 6   pickup_latitude        97986 non-null  float64
 7   dropoff_longitude      97986 non-null  float64
 8   dropoff_latitude       97986 non-null  float64
 9   fare_amount            97986 non-null  float64
 10  tip_amount             97986 non-null  float64
 11  total_amount           97986 non-null  float64
dtypes: float64(8), int64(2), object(2)
memory usage: 9.0+ MB
None
Out[2]:
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude fare_amount tip_amount total_amount
0 2 2015-01-15 19:05:39 +00:00 2015-01-15 19:23:42 +00:00 1 1.59 -73.993896 40.750111 -73.974785 40.750618 12.0 3.25 17.05
1 2 2015-01-15 19:05:39 +00:00 2015-01-15 19:32:00 +00:00 1 2.38 -73.976425 40.739811 -73.983978 40.757889 16.5 4.38 22.68
2 2 2015-01-15 19:05:40 +00:00 2015-01-15 19:21:00 +00:00 5 2.83 -73.968704 40.754246 -73.955124 40.786858 12.5 0.00 14.30
3 2 2015-01-15 19:05:40 +00:00 2015-01-15 19:28:18 +00:00 5 8.33 -73.863060 40.769581 -73.952713 40.785782 26.0 8.08 41.21
4 2 2015-01-15 19:05:41 +00:00 2015-01-15 19:20:36 +00:00 1 2.37 -73.945541 40.779423 -73.980850 40.786083 11.5 0.00 13.30
In [3]:
nycinspections = pd.read_csv("./data/DOHMH_New_York_City_Restaurant_Inspection_Results.csv")
print(nycinspections.info())
nycinspections.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 399918 entries, 0 to 399917
Data columns (total 18 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   CAMIS                  399918 non-null  int64  
 1   DBA                    399559 non-null  object 
 2   BORO                   399918 non-null  object 
 3   BUILDING               399809 non-null  object 
 4   STREET                 399909 non-null  object 
 5   ZIPCODE                399909 non-null  float64
 6   PHONE                  399913 non-null  object 
 7   CUISINE DESCRIPTION    399918 non-null  object 
 8   INSPECTION DATE        399918 non-null  object 
 9   ACTION                 398783 non-null  object 
 10  VIOLATION CODE         393414 non-null  object 
 11  VIOLATION DESCRIPTION  392939 non-null  object 
 12  CRITICAL FLAG          399918 non-null  object 
 13  SCORE                  376704 non-null  float64
 14  GRADE                  195413 non-null  object 
 15  GRADE DATE             192820 non-null  object 
 16  RECORD DATE            399918 non-null  object 
 17  INSPECTION TYPE        398783 non-null  object 
dtypes: float64(2), int64(1), object(15)
memory usage: 54.9+ MB
None
Out[3]:
CAMIS DBA BORO BUILDING STREET ZIPCODE PHONE CUISINE DESCRIPTION INSPECTION DATE ACTION VIOLATION CODE VIOLATION DESCRIPTION CRITICAL FLAG SCORE GRADE GRADE DATE RECORD DATE INSPECTION TYPE
0 40511702 NOTARO RESTAURANT MANHATTAN 635 SECOND AVENUE 10016.0 2126863400 Italian 06/15/2015 Violations were cited in the following area(s). 02B Hot food item not held at or above 140º F. Critical 30.0 NaN NaN 08/28/2017 Cycle Inspection / Initial Inspection
1 40511702 NOTARO RESTAURANT MANHATTAN 635 SECOND AVENUE 10016.0 2126863400 Italian 11/25/2014 Violations were cited in the following area(s). 20F Current letter grade card not posted. Not Critical NaN NaN NaN 08/28/2017 Administrative Miscellaneous / Initial Inspection
2 50046354 VITE BAR QUEENS 2507 BROADWAY 11106.0 3478134702 Italian 10/03/2016 Violations were cited in the following area(s). 10F Non-food contact surface improperly constructe... Not Critical 2.0 NaN NaN 08/28/2017 Pre-permit (Operational) / Initial Inspection
3 50061389 TACK'S CHINESE TAKE OUT STATEN ISLAND 11C HOLDEN BLVD 10314.0 7189839854 Chinese 05/17/2017 Violations were cited in the following area(s). 02G Cold food item held above 41º F (smoked fish ... Critical 46.0 NaN NaN 08/28/2017 Pre-permit (Operational) / Initial Inspection
4 41516263 NO QUARTER BROOKLYN 8015 5 AVENUE 11209.0 7187019180 American 03/30/2017 Violations were cited in the following area(s). 04M Live roaches present in facility's food and/or... Critical 18.0 NaN NaN 08/28/2017 Cycle Inspection / Initial Inspection

3. Exploratory Data Analysis ¶

3.A Missing values in data ¶

In [4]:
# Proportion of missing values in columns
(nycinspections.isna().sum() / nycinspections.shape[0]) * 100
Out[4]:
CAMIS                     0.000000
DBA                       0.089768
BORO                      0.000000
BUILDING                  0.027256
STREET                    0.002250
ZIPCODE                   0.002250
PHONE                     0.001250
CUISINE DESCRIPTION       0.000000
INSPECTION DATE           0.000000
ACTION                    0.283808
VIOLATION CODE            1.626333
VIOLATION DESCRIPTION     1.745108
CRITICAL FLAG             0.000000
SCORE                     5.804690
GRADE                    51.136733
GRADE DATE               51.785116
RECORD DATE               0.000000
INSPECTION TYPE           0.283808
dtype: float64

Finding:

  • More than half of the values in columns Grade and Grade Date contain NA, in NYC inspections data. Although Grade is not a useful feature for the specific task in this notebook, 50% NA values stand out.
  • No NA values in NYC trips data
In [6]:
# Check if Inspection date is valid where Grade is NA
invalid_inspec_date = nycinspections[(nycinspections['INSPECTION DATE'] == '01/01/1900') | (nycinspections['INSPECTION DATE'].isna())]
invalid_inspec_date.groupby(['GRADE'], dropna=False).size()
Out[6]:
GRADE
A         1
NaN    1134
dtype: int64
In [7]:
# Proportion of rows where Grade is NA and Inspection Date is invalid
(invalid_inspec_date.shape[0] / nycinspections['GRADE'].isna().sum()) * 100
Out[7]:
0.5549986552896017

Finding:

  • Only <1% rows with missing Grade have an invalid Inspection Date

3.B Working with date columns ¶

NYC Trips data

In [8]:
# Fix datetime column type
# Current type
print(nyctrips['tpep_pickup_datetime'][0], type(nyctrips['tpep_pickup_datetime'][0]))

nyctrips['tpep_pickup_datetime'] = pd.to_datetime(nyctrips['tpep_pickup_datetime'])
nyctrips['tpep_dropoff_datetime'] = pd.to_datetime(nyctrips['tpep_dropoff_datetime'])

# Date column type after conversion
print(nyctrips['tpep_pickup_datetime'][0], type(nyctrips['tpep_pickup_datetime'][0]))
2015-01-15 19:05:39 +00:00 <class 'str'>
2015-01-15 19:05:39+00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
In [9]:
# First and Last dates
print("NYC TRIPS \nFirst PICKUP: {} \nLast PICKUP: {} \nFirst DROPOFF: {} \nLast DROPOFF: {} ".format(min(nyctrips['tpep_pickup_datetime']),
                                                                                                     max(nyctrips['tpep_pickup_datetime']),
                                                                                                     min(nyctrips['tpep_dropoff_datetime']),
                                                                                                     max(nyctrips['tpep_dropoff_datetime']),
                                                                                                    ))
NYC TRIPS 
First PICKUP: 2015-01-15 00:01:32+00:00 
Last PICKUP: 2015-01-15 23:59:59+00:00 
First DROPOFF: 2015-01-15 00:02:31+00:00 
Last DROPOFF: 2015-01-16 23:41:21+00:00 

NYC Inspections data

In [10]:
# Fix datetime column type
# Current type
print(nycinspections['INSPECTION DATE'][0], type(nycinspections['INSPECTION DATE'][0]))

nycinspections['INSPECTION DATE'] = pd.to_datetime(nycinspections['INSPECTION DATE'])

# Date column type after conversion
print(nycinspections['INSPECTION DATE'][0], type(nycinspections['INSPECTION DATE'][0]))
06/15/2015 <class 'str'>
2015-06-15 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
In [11]:
# First and Last dates
print("NYC INSPECTIONS \nFirst Inspection: {} \nLast Inspection: {}".format(min(nycinspections['INSPECTION DATE']), max(nycinspections['INSPECTION DATE']),))
NYC INSPECTIONS 
First Inspection: 1900-01-01 00:00:00 
Last Inspection: 2017-08-26 00:00:00

3.C Univariate Analysis ¶

NYC Trips data

In [13]:
# Percentile distribution of passenger_count
nyctrips['passenger_count'].quantile([0.25, 0.5, 0.75, 0.9, 0.95, 0.99])
Out[13]:
0.25    1.0
0.50    1.0
0.75    2.0
0.90    4.0
0.95    5.0
0.99    6.0
Name: passenger_count, dtype: float64
In [14]:
# Proportion of taxi rides with a single passenger
(nyctrips[nyctrips['passenger_count']==1].shape[0] / nyctrips['passenger_count'].shape[0]) * 100
Out[14]:
72.38278937807442
In [15]:
fig = px.histogram(nyctrips, x='passenger_count', title="Number of passengers in a taxi trip")
fig.show()
In [16]:
# Percentile distribution of trip_distance (in miles)
nyctrips['trip_distance'].quantile([0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99])
Out[16]:
0.05     0.5000
0.10     0.6500
0.25     1.0000
0.50     1.6700
0.75     3.0700
0.90     6.3000
0.95    10.0000
0.99    18.1715
Name: trip_distance, dtype: float64
In [17]:
fig = px.histogram(nyctrips[nyctrips['trip_distance']<20],
                   x='trip_distance',
                   color='VendorID',
                   title="Distance of a taxi trip in miles (capped at 20 miles/99th percentile)")
fig.show()
In [18]:
# Percentile distribution of total_amount (in USD)
nyctrips['total_amount'].quantile([0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99])
Out[18]:
0.05     5.80
0.10     6.80
0.25     8.30
0.50    11.60
0.75    17.15
0.90    27.80
0.95    42.15
0.99    69.60
Name: total_amount, dtype: float64
In [19]:
fig = px.histogram(nyctrips[(nyctrips['total_amount'] < 70) & (nyctrips['total_amount'] > 0)],
                   x='total_amount',
                   color='VendorID',
                   title="Total amount (incl tips) for a taxi ride")
fig.show()
In [20]:
# trip duration
nyctrips['duration_mins'] = (nyctrips['tpep_dropoff_datetime'] - nyctrips['tpep_pickup_datetime']).astype('timedelta64[m]')
nyctrips['duration_mins'].quantile([0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99])
Out[20]:
0.05     3.0
0.10     4.0
0.25     6.0
0.50    10.0
0.75    16.0
0.90    25.0
0.95    32.0
0.99    55.0
Name: duration_mins, dtype: float64
In [21]:
fig = px.histogram(nyctrips[(nyctrips['duration_mins'] < 60) & (nyctrips['duration_mins'] > 0)],
                   x='duration_mins',
                   color='VendorID',
                   title="Duration of a taxi ride (in minutes) capped at 60 mins/99th percentile")
fig.show()
In [22]:
# busy pickup times during the day
nyctrips['pickup_hour'] = nyctrips['tpep_pickup_datetime'].dt.hour
nyctrips['pickup_hour'].quantile([0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99])
Out[22]:
0.05     1.0
0.10     6.0
0.25     9.0
0.50    15.0
0.75    20.0
0.90    22.0
0.95    23.0
0.99    23.0
Name: pickup_hour, dtype: float64
In [23]:
fig = px.histogram(nyctrips,
                   x='pickup_hour',
                   title="Number of taxi pickups per hour throughout the day")
fig.show()

Findings:

  • Majority of taxi trips (72%) have only 1 passenger. Only about 10% of taxi trips have 4 or more passengers
  • 25% of the taxi trips cover a distance of 1 mile or less. 75% cover 3 miles or less. Only about 10% are longer than 10 miles
  • About half of the taxi trips last 10 minutes or less. Only about 10% of the total trips are longer than 25 minutes
  • Evening hours (7 pm to midnight) are the busiest for taxi pickups, with 10-11 pm being the busiest with ~7.3K pickups. During daytime, 8 am and 4 pm see a considerable rush of taxi trip pickups
  • There is no difference in the patterns between the two vendors in the data (seen in blue and red in the above graphs)

NYC Inspections

In [25]:
fig = px.histogram(nycinspections,
                   x='BORO',
                   title="Number of inspection records across boroughs in NYC",).update_xaxes(categoryorder='total descending')
fig.show()
In [26]:
count_of_restaurants = nycinspections.groupby(['BORO']).agg({'CAMIS': 'nunique'}).reset_index().sort_values('CAMIS', ascending=False)
# total number of restaurants
count_of_restaurants['CAMIS'].sum()
Out[26]:
26505
In [27]:
fig = px.bar(count_of_restaurants,
             x='BORO',
             y='CAMIS',
             title="Number of restaurants across boroughs in NYC",)
fig.show()

Findings:

  • Manhattan has the most number of restaurants (>40% of the total 26.5K restaurants in the NYC Inspections data), followed by Brooklyn and Queens. The number of inspection records follow the same order

4. Linking datasets spatially ¶

4.A Geocoding restaurant addresses ¶

Geocode restaurant addresses to get their lat, lng coordinates using Google Maps API

In [28]:
# Get lat, lng coordinates for restaurant addresses using Google Geocoding API

def geocode_address(inp_address, api_key):
    """
    Converts the address string to lat, long coordinates, given API key for google cloud
    """
    geocode_url = "https://maps.googleapis.com/maps/api/geocode/json"

    # address = building + street + zip
    url_encoded_address = inp_address
    # url_encoded_address = requests.utils.quote(inp_address)
    
    payload = {'address': url_encoded_address,
               'key': api_key,
               'region': 'US'}

    r = requests.get(geocode_url, params=payload)
    # print(r.url)
    
    if r.status_code == 200:
        # print(response.text)
        r_json = r.json()
        address_str = r_json['results'][0]['formatted_address']
        address_lat_lng = r_json['results'][0]['geometry']['location']
        return address_str, address_lat_lng

# read API key
with open("./secret/gcloud_maps_api_key.txt") as api_file:
    api_key = api_file.read()

# address_str, address_lat_lng = geocode_address('635 second avenue 10016', api_key)
# print(address_str, address_lat_lng)
In [29]:
# get unique restaurants
nyc_restaurants = nycinspections[['CAMIS', 'DBA', 'BORO', 'BUILDING', 'STREET', 'ZIPCODE', 'CUISINE DESCRIPTION']].drop_duplicates(ignore_index=True)
print(nyc_restaurants.shape)

# drop restaurants with NA addresses
nyc_restaurants.dropna(axis=0, subset=['BUILDING', 'STREET', 'ZIPCODE'], inplace=True)
print(nyc_restaurants.shape)
(26505, 7)
(26485, 7)
In [30]:
# create an address column with 'NY' string added for a more accurate Geocoding
nyc_restaurants['ADDRESS'] = nyc_restaurants['BUILDING'] + ' ' + nyc_restaurants['STREET'] + ' NY ' + nyc_restaurants['ZIPCODE'].astype('int').astype('str')
In [31]:
# Run geocoding for all restaurants - 26.5K  sequential requests to Geocoding API
# nyc_restaurants['address_lat_lng'] = nyc_restaurants.apply(lambda row: geocode_address(row['ADDRESS'], api_key), axis=1)
In [32]:
# save geocoding results
# nyc_restaurants.to_csv("./nyc_restaurants_lat_lng_geocoding_result.csv", index=False)
nyc_restaurants = pd.read_csv("./nyc_restaurants_lat_lng_geocoding_result.csv")
In [36]:
nyc_restaurants['address_lat_lng'] = nyc_restaurants.apply(lambda row: ast.literal_eval(row['address_lat_lng']), axis=1)
nyc_restaurants['geo_address'] = nyc_restaurants.apply(lambda row: row['address_lat_lng'][0], axis=1)
nyc_restaurants['lat'] = nyc_restaurants.apply(lambda row: row['address_lat_lng'][1]['lat'], axis=1)
nyc_restaurants['lng'] = nyc_restaurants.apply(lambda row: row['address_lat_lng'][1]['lng'], axis=1)

# remove redundant column
nyc_restaurants = nyc_restaurants.drop("address_lat_lng", axis=1)
In [37]:
nyc_restaurants[:3] # lat and lng columns
Out[37]:
CAMIS DBA BORO BUILDING STREET ZIPCODE CUISINE DESCRIPTION ADDRESS geo_address lat lng
0 40511702 NOTARO RESTAURANT MANHATTAN 635 SECOND AVENUE 10016.0 Italian 635 SECOND AVENUE NY 10016 635 2nd Ave, New York, NY 10016, USA 40.745295 -73.975777
1 50046354 VITE BAR QUEENS 2507 BROADWAY 11106.0 Italian 2507 BROADWAY NY 11106 2507 Broadway, New York, NY 10025, USA 40.793126 -73.973306
2 50061389 TACK'S CHINESE TAKE OUT STATEN ISLAND 11C HOLDEN BLVD 10314.0 Chinese 11C HOLDEN BLVD NY 10314 11 Holden Blvd, Staten Island, NY 10314, USA 40.604667 -74.121197

Add Lunch/Dinner labels

Add lunchtime (11.30 am to 2 pm), dinnertime (5 pm to 9 pm) labels to NYC Trips data based on dropoff times. An assumption is made to choose dropoff times instead of pickup times, because trips end at the restaurant at the dropoff time

In [38]:
# Add lunchtime (11.30 am to 2 pm), dinnertime (5 pm to 9 pm) labels to NYC Trips data based on dropoff times
lunch_start, lunch_end = "11:30:00", "14:00:00"
dinner_start, dinner_end = "17:00:00", "21:00:00"

# convert to datetime.time format
lunch_start, lunch_end, dinner_start, dinner_end = map(lambda x: pd.to_datetime(x, format='%H:%M:%S').time(), [lunch_start, lunch_end, dinner_start, dinner_end])

lunchtime = (nyctrips['tpep_dropoff_datetime'].dt.time >= lunch_start) & (nyctrips['tpep_dropoff_datetime'].dt.time <= lunch_end)
dinnertime = (nyctrips['tpep_dropoff_datetime'].dt.time >= dinner_start) & (nyctrips['tpep_dropoff_datetime'].dt.time <= dinner_end)

nyctrips.loc[lunchtime == True, 'meal'] = 'lunchtime'
nyctrips.loc[dinnertime == True, 'meal'] = 'dinnertime'
In [39]:
nyctrips[:3] # lunchtime and dinnertime labels in 'meal' column
Out[39]:
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude fare_amount tip_amount total_amount duration_mins pickup_hour meal
0 2 2015-01-15 19:05:39+00:00 2015-01-15 19:23:42+00:00 1 1.59 -73.993896 40.750111 -73.974785 40.750618 12.0 3.25 17.05 18.0 19 dinnertime
1 2 2015-01-15 19:05:39+00:00 2015-01-15 19:32:00+00:00 1 2.38 -73.976425 40.739811 -73.983978 40.757889 16.5 4.38 22.68 26.0 19 dinnertime
2 2 2015-01-15 19:05:40+00:00 2015-01-15 19:21:00+00:00 5 2.83 -73.968704 40.754246 -73.955124 40.786858 12.5 0.00 14.30 15.0 19 dinnertime

4.B Distance between two (lat, lng) points ¶

Two commonly used methods of measuring distances between two coordinate points on the Earth are using:

  • Great Circle distance: Assumes Earth is a sphere with a radius of approx 6371 km. This is fast to compute but is less accurate.
  • Geodesic distance: Assumes Earth to be ellipsoidal and gives more accurate results than Great Circle distance, but is slower to compute. Karney (2013).

The package geopy provides implementations of both methods of computing distances

Finding the nearest restaurant for each Lunch, Dinner trip

In [41]:
# Subset of lunch and dinner taxi trips
nyctrips_meal = nyctrips[nyctrips['meal'].isna() == False]
nyctrips_meal.shape
Out[41]:
(33479, 15)
In [42]:
nyc_restaurants.reset_index(drop=True, inplace=True)
nyctrips_meal.reset_index(drop=True, inplace=True)
In [44]:
%%time
def get_nearest_restaurant(lat, lng):
    """
    Returns the nearest restaurant (within a radius of 50m) from the NYC Inspections data, to the given (lat, lng) point.
    Geodesic distance is measured in metres using geopy.distance
    """
    print("Comparing lat {}, lng {} with all restaurants".format(lat, lng))
    trip_restaurant_distances = pd.Series(distance.great_circle((lat, lng), (r_lat, r_lng)).m for (r_lat, r_lng) in zip(nyc_restaurants['lat'], nyc_restaurants['lng'])) 
    # print(trip_restaurant_distances)
    min_restaurant_distance_idx = trip_restaurant_distances.argmin()
    min_restaurant_distance = trip_restaurant_distances[min_restaurant_distance_idx]
    # print(trip_restaurant_distances)
    return min_restaurant_distance_idx, min_restaurant_distance

# nyctrips_meal['nearest_rest'] = nyctrips_meal.apply(lambda row: get_nearest_restaurant(row['dropoff_latitude'], row['dropoff_longitude']), axis=1)

## split restaurant info column into two separate columns
# nyctrips_meal['nearest_rest_idx'] = nyctrips_meal.apply(lambda row: row['nearest_rest'][0], axis=1) 
# nyctrips_meal['nearest_rest_dist_m'] = nyctrips_meal.apply(lambda row: row['nearest_rest'][1], axis=1)
CPU times: total: 0 ns
Wall time: 0 ns
In [45]:
# nyctrips_meal.to_csv("./nyc_trips_nearest_restaurant_result.csv", index=False)
nyctrips_meal = pd.read_csv("./nyc_trips_nearest_restaurant_result.csv")
In [46]:
rest_in_50_m = nyctrips_meal['nearest_rest_dist_m'] < 51
nyctrips_meal.loc[rest_in_50_m == True, 'rest_idx_in_50m'] = nyctrips_meal['nearest_rest_idx']
In [47]:
# Proportion of trips during meal times (lunch or dinner) that were not to a nearby restaurant (within 50m)
(nyctrips_meal['rest_idx_in_50m'].isna().sum() / nyctrips_meal.shape[0]) * 100
Out[47]:
28.53430508677081

Findings:

  • About 29% of trips during mealtime (lunch or dinner) had no nearby restaurants within 50 metres of their taxi dropoff point
  • These trips would be removed in the subsequent analysis
In [49]:
# remove tripts without a nearby restaunt
nyctrips_meal.dropna(subset=['rest_idx_in_50m'], axis=0, inplace=True)

# drop trips with negative fare
nyctrips_meal = nyctrips_meal[nyctrips_meal['fare_amount']>0]

nyctrips_meal = nyctrips_meal.reset_index(drop=True)
# nyctrips_meal.info()
In [51]:
nyctrips_meal[:2]
Out[51]:
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude fare_amount tip_amount total_amount duration_mins pickup_hour meal nearest_rest nearest_rest_idx nearest_rest_dist_m rest_idx_in_50m
0 2 2015-01-15 19:05:39+00:00 2015-01-15 19:32:00+00:00 1 2.38 -73.976425 40.739811 -73.983978 40.757889 16.5 4.38 22.68 26 19 dinnertime (11866, 2.631334813536103) 11866 2.631335 11866.0
1 2 2015-01-15 19:05:41+00:00 2015-01-15 19:20:22+00:00 2 7.13 -73.874458 40.774010 -73.952377 40.718590 21.5 4.50 27.80 14 19 dinnertime (10855, 46.96017724326131) 10855 46.960177 10855.0
In [52]:
nyc_restaurants[:2]
Out[52]:
CAMIS DBA BORO BUILDING STREET ZIPCODE CUISINE DESCRIPTION ADDRESS geo_address lat lng
0 40511702 NOTARO RESTAURANT MANHATTAN 635 SECOND AVENUE 10016.0 Italian 635 SECOND AVENUE NY 10016 635 2nd Ave, New York, NY 10016, USA 40.745295 -73.975777
1 50046354 VITE BAR QUEENS 2507 BROADWAY 11106.0 Italian 2507 BROADWAY NY 11106 2507 Broadway, New York, NY 10025, USA 40.793126 -73.973306

4.C Merged dataset - Trips with nearby restaurant ¶

In [53]:
# get restaurant details
nyctrips_meal_rest = pd.merge(nyctrips_meal, nyc_restaurants, how='left', left_on=['rest_idx_in_50m'], right_index=True, validate='m:1')
In [54]:
print(nyctrips_meal_rest.info())
nyctrips_meal_rest.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23917 entries, 0 to 23916
Data columns (total 30 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   VendorID               23917 non-null  int64  
 1   tpep_pickup_datetime   23917 non-null  object 
 2   tpep_dropoff_datetime  23917 non-null  object 
 3   passenger_count        23917 non-null  int64  
 4   trip_distance          23917 non-null  float64
 5   pickup_longitude       23917 non-null  float64
 6   pickup_latitude        23917 non-null  float64
 7   dropoff_longitude      23917 non-null  float64
 8   dropoff_latitude       23917 non-null  float64
 9   fare_amount            23917 non-null  float64
 10  tip_amount             23917 non-null  float64
 11  total_amount           23917 non-null  float64
 12  duration_mins          23917 non-null  int64  
 13  pickup_hour            23917 non-null  int64  
 14  meal                   23917 non-null  object 
 15  nearest_rest           23917 non-null  object 
 16  nearest_rest_idx       23917 non-null  int64  
 17  nearest_rest_dist_m    23917 non-null  float64
 18  rest_idx_in_50m        23917 non-null  float64
 19  CAMIS                  23917 non-null  int64  
 20  DBA                    23768 non-null  object 
 21  BORO                   23917 non-null  object 
 22  BUILDING               23917 non-null  object 
 23  STREET                 23917 non-null  object 
 24  ZIPCODE                23917 non-null  float64
 25  CUISINE DESCRIPTION    23917 non-null  object 
 26  ADDRESS                23917 non-null  object 
 27  geo_address            23917 non-null  object 
 28  lat                    23917 non-null  float64
 29  lng                    23917 non-null  float64
dtypes: float64(13), int64(6), object(11)
memory usage: 5.5+ MB
None
Out[54]:
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude fare_amount ... DBA BORO BUILDING STREET ZIPCODE CUISINE DESCRIPTION ADDRESS geo_address lat lng
0 2 2015-01-15 19:05:39+00:00 2015-01-15 19:32:00+00:00 1 2.38 -73.976425 40.739811 -73.983978 40.757889 16.5 ... MARGON RESTAURANT MANHATTAN 136 WEST 46 STREET 10036.0 Latin (Cuban, Dominican, Puerto Rican, South &... 136 WEST 46 STREET NY 10036 136 W 46th St, New York, NY 10036, USA 40.757872 -73.984000
1 2 2015-01-15 19:05:41+00:00 2015-01-15 19:20:22+00:00 2 7.13 -73.874458 40.774010 -73.952377 40.718590 21.5 ... OVER THE EIGHT BROOKLYN 594 UNION AVENUE 11211.0 American 594 UNION AVENUE NY 11211 594 Union Ave, Brooklyn, NY 11211, USA 40.718238 -73.952069
2 2 2015-01-15 19:05:41+00:00 2015-01-15 19:31:00+00:00 1 3.60 -73.976601 40.751896 -73.998924 40.714596 17.5 ... SHANGHAI ASIAN MANOR MANHATTAN 21 MOTT STREET 10013.0 Chinese 21 MOTT STREET NY 10013 21 Mott St, New York, NY 10013, USA 40.714407 -73.999033
3 2 2015-01-15 19:05:41+00:00 2015-01-15 19:10:22+00:00 1 0.89 -73.994957 40.745079 -73.999939 40.734650 5.5 ... QUALITY EATS MANHATTAN 19 GREENWICH AVE 10014.0 American 19 GREENWICH AVE NY 10014 19 Greenwich Ave, New York, NY 10014, USA 40.734502 -74.000055
4 2 2015-01-15 19:05:41+00:00 2015-01-15 19:10:55+00:00 1 0.96 -74.000938 40.747063 -74.003563 40.735512 5.5 ... SANT AMBROEUS MANHATTAN 259 WEST 4 STREET 10014.0 Italian 259 WEST 4 STREET NY 10014 259 West 4th Street, New York, NY 10014, USA 40.735426 -74.003141

5 rows × 30 columns

5. Exploratory Visualization ¶

In [73]:
# px.set_mapbox_access_token(open("./secret/mapbox_token.txt").read())
In [57]:
fig = px.scatter_mapbox(nyctrips_meal_rest,
                        lat="dropoff_latitude", lon="dropoff_longitude",
                        color="pickup_hour",
                        size='trip_distance',
                        mapbox_style='stamen-terrain',
                        color_continuous_scale=px.colors.cyclical.IceFire,
                        hover_data=['trip_distance', 'passenger_count', 'fare_amount', 'BORO', 'DBA', 'CUISINE DESCRIPTION'],
                        height=400, zoom=12)

fig.update_layout(margin={"r":5,"t":5,"l":5,"b":5})
fig.show()

Interactive scatterplot visualization showing taxi trips to restaurants in New York during lunch and dinner time

  • Hover over the bubbles to view more information
  • Size of the bubble indicates trip distance in miles
  • Color of the bubble indicates the hour of the day (24-hr format) when the passenger was picked up by the taxi
  • Taxi Trip data sourced from NYC TLC
  • Restaurant Info data sourced from NYC Open Data

6. EDA Questions ¶

6.A Trip Distance ¶

a. How far do people travel based on different types of cuisine (“CUISINE DESCRIPTION”)? How does this differ based on the borough where the restaurant is located (“BORO”, one of 5 large NYC neighborhoods)? How does this differ by meal time?

In [58]:
pd.set_option('display.max_rows', 500)
In [59]:
# Travel times by one variable
nyctrips_meal_rest.groupby(['CUISINE DESCRIPTION']).agg({'trip_distance': ['mean', 'median', 'std', 'min', 'max', 'count']})
Out[59]:
trip_distance
mean median std min max count
CUISINE DESCRIPTION
Afghan 2.385000 2.385 0.601041 1.96 2.81 2
African 2.666667 1.700 2.376010 0.63 9.15 27
American 2.429653 1.500 3.007018 0.00 28.98 8118
Armenian 2.581429 1.760 1.710982 1.00 6.55 14
Asian 2.701055 1.690 3.311389 0.30 21.11 199
Australian 2.270000 1.430 2.188100 0.34 6.26 11
Bagels/Pretzels 2.298516 1.500 2.493604 0.00 19.30 155
Bakery 2.260299 1.400 2.765800 0.00 21.50 334
Bangladeshi 3.245000 1.785 4.408369 0.44 15.40 10
Barbecue 2.184722 1.290 2.639786 0.00 19.28 72
Bottled beverages, including water, sodas, juices, etc. 1.757931 1.200 1.424034 0.40 6.50 29
Brazilian 2.263462 1.895 1.744617 0.30 8.70 26
Café/Coffee/Tea 2.350138 1.430 2.995058 0.00 22.04 2399
Cajun 0.650000 0.650 0.070711 0.60 0.70 2
Californian 0.812500 0.880 0.215310 0.50 0.99 4
Caribbean 4.323889 3.900 3.152185 0.35 13.18 36
Chicken 2.709403 1.775 3.080042 0.20 19.50 134
Chilean 2.525000 2.085 1.899839 0.52 5.66 6
Chinese 2.464896 1.595 2.901235 0.00 20.40 576
Chinese/Cuban 1.767419 1.450 1.119559 0.46 5.28 31
Chinese/Japanese 2.232157 1.700 2.009778 0.20 12.00 51
Continental 2.283390 1.350 3.244220 0.33 17.94 59
Creole 14.445000 14.445 14.630039 4.10 24.79 2
Delicatessen 2.342610 1.420 2.897948 0.00 20.94 433
Donuts 2.216626 1.510 2.466601 0.20 19.86 406
Eastern European 2.656410 1.470 3.684707 0.00 17.70 39
Egyptian 3.156667 1.150 4.585990 0.60 12.37 6
English 2.161515 1.510 2.442808 0.50 11.84 33
Ethiopian 2.830556 2.600 2.266278 1.00 10.77 18
Filipino 3.566667 1.945 3.588469 0.70 9.35 6
French 2.037975 1.350 2.469726 0.00 20.83 894
German 2.466531 1.860 1.796269 0.40 7.30 49
Greek 2.217661 1.500 1.997133 0.15 12.35 124
Hamburgers 2.171037 1.600 2.185650 0.20 19.00 241
Hawaiian 2.387500 1.945 1.671543 0.90 6.20 12
Hotdogs 1.846047 1.600 1.633760 0.30 10.35 43
Hotdogs/Pretzels 2.249167 1.395 3.072190 0.52 18.42 48
Ice Cream, Gelato, Yogurt, Ices 2.016991 1.400 2.094419 0.20 17.74 216
Indian 2.287871 1.420 2.994588 0.00 19.71 202
Indonesian 2.346000 1.750 2.024618 0.70 5.68 5
Iranian 1.290000 1.290 NaN 1.29 1.29 1
Irish 1.992092 1.340 2.395222 0.00 18.31 325
Italian 2.222564 1.400 2.680227 0.00 28.50 1650
Japanese 2.194248 1.490 2.659043 0.00 19.70 911
Jewish/Kosher 2.338398 1.600 2.755152 0.20 18.07 206
Juice, Smoothies, Fruit Salads 2.150811 1.500 2.360630 0.00 18.40 259
Korean 2.650000 1.430 3.454270 0.29 17.80 135
Latin (Cuban, Dominican, Puerto Rican, South & Central American) 2.585144 1.690 2.907344 0.00 23.40 278
Mediterranean 1.925054 1.400 2.174275 0.00 20.40 279
Mexican 2.506058 1.500 3.078704 0.00 33.30 520
Middle Eastern 2.474492 1.705 2.336887 0.36 14.95 118
Moroccan 1.791429 1.475 1.041586 0.20 5.10 28
Nuts/Confectionary 1.081250 1.190 0.539615 0.10 1.81 8
Other 2.106232 1.400 2.422389 0.00 22.40 775
Pakistani 3.610000 1.320 6.091768 0.68 17.40 7
Pancakes/Waffles 5.010000 5.010 2.602153 3.17 6.85 2
Peruvian 3.165714 1.500 4.150705 0.30 12.70 14
Pizza 2.405843 1.555 2.715260 0.00 19.85 676
Pizza/Italian 2.005352 1.380 2.435122 0.00 19.52 454
Polish 2.697500 2.895 1.391219 0.90 4.10 4
Portuguese 1.962500 2.125 1.227457 0.50 4.10 8
Russian 3.184737 1.790 4.793437 0.39 17.70 19
Salads 2.294785 1.610 2.227556 0.20 17.30 163
Sandwiches 2.009944 1.500 1.899215 0.00 17.35 357
Sandwiches/Salads/Mixed Buffet 3.173097 1.525 3.702273 0.00 18.55 494
Scandinavian 2.718333 1.700 2.163797 0.53 5.94 6
Seafood 2.368639 1.550 2.869940 0.00 17.96 169
Soul Food 5.387778 2.600 5.721577 0.80 18.35 18
Soups & Sandwiches 2.051159 1.300 2.476305 0.30 17.31 69
Southwestern 1.828333 1.565 1.044651 0.94 3.90 6
Spanish 2.432581 1.660 2.092671 0.31 11.94 155
Steak 2.383605 1.400 3.338558 0.14 21.83 172
Tapas 1.967708 1.365 1.770920 0.50 10.90 48
Tex-Mex 2.262111 1.515 2.144468 0.30 12.60 90
Thai 2.445107 1.600 2.641388 0.10 19.14 233
Turkish 2.128971 1.295 2.441262 0.20 11.53 68
Vegetarian 1.875169 1.400 2.088473 0.30 18.20 89
Vietnamese/Cambodian/Malaysia 1.983871 1.150 2.283297 0.50 11.10 31
In [60]:
# plotting median trip_distance for all cuisines
eda_traveltime_cuisine = nyctrips_meal_rest.groupby(['CUISINE DESCRIPTION']).agg({'trip_distance': 'median', 'passenger_count': 'count'}).reset_index()
px.bar(eda_traveltime_cuisine[eda_traveltime_cuisine['passenger_count']>=10], x='CUISINE DESCRIPTION', y='trip_distance',
      title='Median travel distance (miles) by cuisine types (Cuisines >=10 trips)',
      height=500)
In [61]:
# plotting median trip_distance for all cuisines
"{} out of the total {} cuisine types, or about {}% cuisine types, have a median taxi trip distance under 2 miles".format((eda_traveltime_cuisine['trip_distance']<=2).sum(), eda_traveltime_cuisine.shape[0], round((eda_traveltime_cuisine['trip_distance']<=2).sum()/eda_traveltime_cuisine.shape[0] * 100),0)
Out[61]:
'69 out of the total 78 cuisine types, or about 88% cuisine types, have a median taxi trip distance under 2 miles'
In [62]:
# Travel times by borough
nyctrips_meal_rest.groupby(['BORO']).agg({'trip_distance': ['mean', 'median', 'std', 'min', 'max', 'count']})
Out[62]:
trip_distance
mean median std min max count
BORO
BRONX 10.529565 11.10 5.697970 1.1 20.80 23
BROOKLYN 5.215671 4.18 4.104402 0.0 23.58 663
MANHATTAN 2.165895 1.41 2.608664 0.0 33.30 22770
QUEENS 6.676768 6.10 4.494873 0.0 24.79 461
In [63]:
# plotting median trip_distance for all boroughs
eda_traveltime_boro = nyctrips_meal_rest.groupby(['BORO']).agg({'trip_distance': 'median'}).reset_index()
px.bar(eda_traveltime_boro, x='BORO', y='trip_distance',
      title='Median taxi trip distance (miles) in 5 boroughs of New York')
In [64]:
# Travel time by meal
nyctrips_meal_rest.groupby(['meal']).agg({'trip_distance': ['mean', 'median', 'std', 'min', 'max', 'count']})
Out[64]:
trip_distance
mean median std min max count
meal
dinnertime 2.441954 1.5 2.985325 0.0 33.3 15642
lunchtime 2.162967 1.4 2.523152 0.0 23.4 8275

Summary:
Looking at the various descriptive statistics for trip_distance, we observe that:

  • Most cuisine types (88% of the total 78 types) see a median trip distance of under 2 miles
  • People travel the farthest for the following cuisines: Caribbean (3.9 miles), Ethiopian (2.6 miles) and Soul Food (2.6 miles)
  • Bronx has the highest median travel distance (11 miles) across the 5 boroughs, with the least number of trips 23. Manhattan has the most trips, about 23K, with the least median travel distance (1.4 miles)
  • Travel distances at dinner times are slightly longer than lunch times, median distances of 1.5 miles vs 1.4 miles for dinner and lunch, respectively

6.B Tipping Rate ¶

b. What is the average tipping rate for different types of cuisine? How does this differ by borough, by meal time, and by number of passengers in the taxi?

In [65]:
nyctrips_meal_rest.groupby(['CUISINE DESCRIPTION']).agg({'tip_amount': 'mean'}).reset_index()
Out[65]:
CUISINE DESCRIPTION tip_amount
0 Afghan 2.750000
1 African 1.514815
2 American 1.618282
3 Armenian 1.599286
4 Asian 1.658794
5 Australian 0.709091
6 Bagels/Pretzels 1.701419
7 Bakery 1.384401
8 Bangladeshi 2.427000
9 Barbecue 1.564167
10 Bottled beverages, including water, sodas, jui... 1.372414
11 Brazilian 0.959615
12 Café/Coffee/Tea 1.585632
13 Cajun 0.750000
14 Californian 1.412500
15 Caribbean 1.816389
16 Chicken 1.324552
17 Chilean 1.600000
18 Chinese 1.449253
19 Chinese/Cuban 1.186129
20 Chinese/Japanese 1.580392
21 Continental 1.283559
22 Creole 0.000000
23 Delicatessen 1.502887
24 Donuts 1.672143
25 Eastern European 2.031795
26 Egyptian 1.635000
27 English 1.413333
28 Ethiopian 2.221111
29 Filipino 1.300000
30 French 1.440134
31 German 1.555102
32 Greek 1.500565
33 Hamburgers 1.655353
34 Hawaiian 1.911667
35 Hotdogs 1.346279
36 Hotdogs/Pretzels 1.628125
37 Ice Cream, Gelato, Yogurt, Ices 1.444537
38 Indian 1.481881
39 Indonesian 1.600000
40 Iranian 0.000000
41 Irish 1.410277
42 Italian 1.470279
43 Japanese 1.534720
44 Jewish/Kosher 1.479612
45 Juice, Smoothies, Fruit Salads 1.523398
46 Korean 1.422148
47 Latin (Cuban, Dominican, Puerto Rican, South &... 1.595108
48 Mediterranean 1.520430
49 Mexican 1.619250
50 Middle Eastern 1.323051
51 Moroccan 1.240714
52 Nuts/Confectionary 1.125000
53 Other 1.368942
54 Pakistani 2.797143
55 Pancakes/Waffles 2.750000
56 Peruvian 1.055714
57 Pizza 1.513373
58 Pizza/Italian 1.515264
59 Polish 3.175000
60 Portuguese 1.645000
61 Russian 1.952632
62 Salads 1.588589
63 Sandwiches 1.396639
64 Sandwiches/Salads/Mixed Buffet 2.183745
65 Scandinavian 1.025000
66 Seafood 1.734911
67 Soul Food 3.491111
68 Soups & Sandwiches 1.200000
69 Southwestern 1.258333
70 Spanish 1.511484
71 Steak 1.569070
72 Tapas 1.395625
73 Tex-Mex 1.679222
74 Thai 1.696395
75 Turkish 1.655882
76 Vegetarian 1.664607
77 Vietnamese/Cambodian/Malaysia 1.166774
In [66]:
eda_tip_cuisine = nyctrips_meal_rest.groupby(['CUISINE DESCRIPTION']).agg({'tip_amount': 'mean', 'passenger_count': 'count'}).reset_index()
px.bar(eda_tip_cuisine[eda_tip_cuisine['passenger_count']>=10], x='CUISINE DESCRIPTION', y='tip_amount',
      title='Average tip amount (USD) by cuisine types (Cuisines >=10 trips)',
      height=500)
In [67]:
nyctrips_meal_rest.groupby(['BORO']).agg({'tip_amount': 'mean'}).reset_index()
Out[67]:
BORO tip_amount
0 BRONX 2.731304
1 BROOKLYN 2.707526
2 MANHATTAN 1.489329
3 QUEENS 3.675944
In [68]:
eda_traveltime_boro = nyctrips_meal_rest.groupby(['BORO']).agg({'tip_amount': 'mean'}).reset_index()
px.bar(eda_traveltime_boro, x='BORO', y='tip_amount',
      title='Average tip amount (USD) in 5 boroughs of New York')
In [69]:
nyctrips_meal_rest.groupby(['meal']).agg({'tip_amount': 'mean'}).reset_index()
Out[69]:
meal tip_amount
0 dinnertime 1.615440
1 lunchtime 1.473816
In [70]:
nyctrips_meal_rest.groupby(['passenger_count']).agg({'tip_amount': 'mean'}).reset_index()
Out[70]:
passenger_count tip_amount
0 0 1.394444
1 1 1.585499
2 2 1.524652
3 3 1.473877
4 4 1.466503
5 5 1.534502
6 6 1.527564
In [71]:
eda_traveltime_boro = nyctrips_meal_rest.groupby(['passenger_count']).agg({'tip_amount': 'mean'}).reset_index()
px.line(eda_traveltime_boro, x='passenger_count', y='tip_amount',
       title='Average tip amount (USD) across passenger count in a taxi trip')

Summary:

  • Taxi tipping amount varies across restaurants' cuisine types. People going to Soul Food restaurants tend to tip the highest on average (USD 3.5), followed by Bangladeshi (USD 2.4) and Ethiopian (USD 2.2) cuisines
  • Across the 5 boroughs in New York, people going to restaurants in Queens tip their taxi drivers the highest amount on average (USD 3.7), and taxi rides ending in restaurants in Manhattan tend to receive the least tips on average per trip (USD 1.5)
  • People tend to tip slightly higher during dinner time meals than lunch times (USD 1.6 vs 1.5)
  • Tipping amount is maximum for a single passenger (USD 1.6) and then decreases linearly as the number of passengers increase till 4 people. After 5 or more passengers, the average tipping amount increases again (USD 1.5), possibly due to the taxi being a bigger car.

7. Predictive Model ¶

Describe how you would set up a predictive model of restaurant cuisine type to be visited by a taxi rider based on information present in the two datasets

In order to build a model to determine the restaurant cuisine type visited by a taxi rider, we should start by answering the question - Where in the journey of a taxi ride would this model be utilized? A model that gives a prediction of the restaurant cuisine at the beginning of the taxi trip is different than a model that predicts the cuisine type after the trip is over.

Application of the predictive model

The former model would be useful for the taxi cab company. For example, at the beginning of the journey, knowing the predicted restaurant cuisine type would help the taxi company tailor the ride experience of the passenger(s). They might offer customized recommendations from the menu, information such as nutrition value, ability to do table reservations, offer discounts, show promotions and display any event information at the restaurant venue. This would enhance the ride experience of the passengers before they reach the restaurant. Furthermore, building such a predictive model would have access to a limited number of input features. The actual values of the trip distance, trip time, total fare, tip amount, exact dropoff coordinates would be unknown during the trip and would have to be estimated.

In contrast to this, the latter model which predicts the restaurant cuisine type after the taxi trip is over would be useful to analyze the movement of people from a macro lens. This post facto analysis might uncover food consumption patterns of masses from their movement data. Having the capability of predicting the cuisine type that people consume based on their journeys can inform decisions such opening a new restaurant or meal delivery service in a new region. This can be done by correlating taxi ride patterns between two regions - one where model was trained, and another region where a new restaurant/meal delivery service is to be started.

Model specification

For the task, we can build a classification model, specifically a multi-class classification model. Using the historical information present in the NYC taxi trips and the NYC restaurant inspections datasets as the independent input variables, the multi-class classification model would be able to predict a value for the dependent target variable, restaurant cuisine type, for a given set of inputs.

Predictive Model: Multi-class Classification (or Multi-label if we want predictions of multiple possible cuisine types. We might be able to uncover insights from cuisine types that are most frequently predicted together)
Input: Features such as taxi trip details (passenger_count, trip_distance, fare_amount, pickup location, pickup time and more. A subset of most important features to be identified via feature importance techniques
Target: Restaurant Cuisine Type (78 classes or cuisine types)
Model(s): Logistic Regression, Decision Trees, Random Forest, Support Vector Machines, Neural Networks

Additional data: Weather, temperature, holidays and events also affect what cuisine people choose to eat. In addition, we can get additional restaurant attributes such as ratings and reviews, price range. Presence of such data may help increase the predictive model's performance.

Building and evaluating the predictive model

  1. Data pre-processing: Although minimal pre-processing was done in this notebook (such as removing Null values from restaurant addresses, removing trips where the fare amount was negative), thorough cleaning of input data would be required before it is ready for the model. This involves treating missing values in all columns, outliers in numerical columns like trip distance, fare amount, convert categorical features like pickup_hour to numerical, although some models like Random Forest can handle categorical columns too.

  2. Treating Class Imbalance: Out of 78 cuisine types, only 68 cuisine types have more than 10 taxi trips, and only 34 have more than 100 taxi trips. Although these low number of data points can be improved by acquiring taxi trip data for more days (we only have data for 1 day in this notebook). There would still be some cuisine types that are over-represented in the training data and some that would be under-represented. Possible solutions to this problem are methods such as SMOTE (Synthetic Minority Over-sampling Technique) to create synthetic data samples for under-represented cuisines, or merging similar cuisine types.

  3. Data Splitting: Splitting data into training, validation and test sets. Training set would be used to train the model, validation to tune the hyperparameters and test set to evaluate the model performance on unseen data.

  4. Training and Hyperparameter tuning: Training would be performed on the training set, and then tuning model's hyperparameters using the validation set. The hyperparameters for different types of models can include the regularization parameter, max depth of tree, learning rate, number of hidden layers, and number of neurons per layer

  5. Evaluating the model: Finally, we would evaluate the model's performance on the test set using metrics such as accuracy, precision, recall, and F1 score. These metrics will provide insights into how well the model is predicting the cuisine type based on taxi ride information. Based on the model evaluation results, we can choose the best performing model or create an ensemble of models for the prediction.